motivating application
linear probes
training concept vectors
probes vs. concepts
Alain, G., & Bengio, Y. (2017). Understanding intermediate layers using linear classifier probes. OpenReview. https://openreview.net/forum?id=ryF7rTqgl
Schmalwasser, L., Penzel, N., Denzler, J., & Niebling, J. (2025). FastCAV: Efficient computation of concept activation vectors for explaining deep neural networks. In Proceedings of the 42nd International Conference on Machine Learning. https://openreview.net/forum?id=kRmfzTfIGe
Predict livability from aerial imagery.
Data. 51,781 grid cells (100m²) in Netherlands. Each has livability score \(y \in \mathbf{R}\) and 500×500px aerial image \(x\).
Supplemental. FLAIR land use labels from French National Institute.
Goal. Which visual concepts does \(f(x;\theta)\) use to predict \(y\)?
Concept labels \(c \in \{1,\ldots,K\}\) from auxiliary dataset.
\(h_l(x)\) — activation vector at layer \(l\)
\(p_l(h) = \text{softmax}(W_l h + b_l)\) — linear classifier predicting \(y\) from \(h_l\)
Diode constraint. \(\frac{\partial \mathcal{L}}{\partial \theta} = 0\) (probe training doesn’t affect base model)
Minimize empirical risk on hidden representations: \[\min_{W_l, b_l} \sum_{i=1}^N \mathcal{L}(p_l(h_l(x_i)), y_i)\]
Validation. Test on held-out data to ensure genuine learned features, not memorization.
Information vs. Accessibility. Total mutual information \(I(X; h_l)\) may decrease with depth (Data Processing Inequality), but linear accessibility of task-relevant information typically increases.

Early layers: generic features (edges). Deep layers: task-specific, linearly separable.
Bug detection via probes. Skip connection from layer 1→64 rendered layers 2–63 “dead” (flat error rates).
Write the pseudocode to extract activations and train a probe for a single layer.
CAV. \(v_c^l \in \mathbf{R}^{d_l}\) — direction of concept \(c\) in layer \(l\)
Concept Sensitivity. Directional derivative of class \(k\) logit along concept vector: \[S_{c,k}(x) = \nabla_{h} f_k(h_l(x)) \cdot v_c^l\]
TCAV Score. \(T_{c,k} = \frac{|\{x \in X_k : S_{c,k}(x) > 0\}|}{|X_k|}\)
Fraction of class-\(k\) images where concept \(c\) increases model confidence.
Step 1. Define concept images \(x_n \in D_c\) (positive set). Construct control pool \(x_n' \in N\) (negative set, random images).
Concept: “stripes”
Step 2. Find direction \(v_c^l\) that best separates \(h_l(D_c)\) from \(h_l(N)\) in activation space.
Standard. Train linear SVM. CAV is the normal vector to decision boundary.
Step 3. Compute sensitivity of model’s prediction to this direction.
\[\text{Sensitivity} = \text{Model Gradient} \cdot \text{Concept Direction}\]
Global explanation: “To what extent does concept ‘stripes’ contribute to the model’s definition of ‘zebra’?”
\(\nabla y_k(h(x))\) is the steepest ascent direction for class \(k\)’s logit.
\(v_{\text{FastCAV}} = \frac{\bar{h}_c - \bar{h}_r}{\|\bar{h}_c - \bar{h}_r\|}\)
Equivalent to SVM under isotropic covariance (\(\Sigma = \sigma^2 I\)).
Leads to 46.4× average speedup.
More details in next lecture.
Probes-specific
Concepts-specific